Deploying and Serving LLM with vLLM

End-to-end guide: deploy and serve LLMs at scale with vLLM for high-throughput, low-latency inference

Published

July 15, 2025

Keywords: vLLM, LLM serving, model deployment, inference optimization, PagedAttention, OpenAI API, batching, GPU inference, production LLM

Introduction

Serving Large Language Models (LLMs) in production requires more than just loading a model and running inference. You need high throughput, low latency, and efficient GPU memory usage to handle real-world traffic.

vLLM is an open-source library designed specifically for this purpose. It makes LLM serving:

Fast (up to 24x higher throughput than naive serving)
Memory-efficient (via PagedAttention)
Production-ready (OpenAI-compatible API server)
Easy to deploy (Docker, Kubernetes, cloud)

In this tutorial, we will walk through a complete pipeline:

Install and configure vLLM
Serve a model with the OpenAI-compatible API
Query the model from Python
Optimize for production deployment

graph LR
    A["Install vLLM"] --> B["Load / configure<br/>model"]
    B --> C["Start OpenAI-compatible<br/>API server"]
    C --> D["Query from<br/>Python / curl"]
    D --> E["Deploy to<br/>production"]

    style A fill:#ffce67,stroke:#333
    style B fill:#ffce67,stroke:#333
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#6cc3d5,stroke:#333,color:#fff
    style E fill:#56cc9d,stroke:#333,color:#fff

What is vLLM?

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. Key features include:

PagedAttention: Efficient memory management inspired by OS virtual memory, reducing GPU memory waste
Continuous batching: Dynamically batches incoming requests for maximum throughput
OpenAI-compatible API: Drop-in replacement for OpenAI API endpoints
Tensor parallelism: Distribute models across multiple GPUs
Support for many models: Llama, Mistral, Qwen, Phi, Gemma, and more

graph TD
    A["vLLM Engine"] --> B["PagedAttention<br/>Memory efficiency"]
    A --> C["Continuous Batching<br/>Max throughput"]
    A --> D["OpenAI-compatible API<br/>Drop-in replacement"]
    A --> E["Tensor Parallelism<br/>Multi-GPU support"]
    A --> F["Wide Model Support<br/>Llama, Mistral, Qwen..."]

    style A fill:#56cc9d,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#6cc3d5,stroke:#333,color:#fff
    style E fill:#6cc3d5,stroke:#333,color:#fff
    style F fill:#6cc3d5,stroke:#333,color:#fff

Hardware Requirements

vLLM is designed for GPU inference. Minimum requirements depend on your model size:

Model Size	Minimum GPU VRAM	Recommended GPU
0.5B–3B	4 GB	RTX 3060 / T4
7B–8B	16 GB	RTX 4090 / A10
13B	24 GB	A10 / A100
70B	80 GB+	A100 / H100 (multi-GPU)

For CPU-only machines, consider using Ollama or llama.cpp instead.

Installation

Install vLLM

pip install vllm

Install with CUDA support (recommended)

pip install vllm[cuda]

Verify installation

import vllm
print(vllm.__version__)

Offline Inference (Batch Processing)

Use vLLM for fast batch inference without starting a server.

Basic Example

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=256,
)

prompts = [
    "Explain machine learning in simple terms.",
    "What is the difference between AI and ML?",
    "Write a Python function to reverse a string.",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)
    print("---")

Chat-style Inference

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain vLLM in simple terms."},
]

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=256,
)

outputs = llm.chat(messages=[messages], sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

Serving with OpenAI-Compatible API

vLLM provides an API server that is fully compatible with the OpenAI API format.

graph LR
    A["vLLM Server<br/>(port 8000)"] --> B["OpenAI-compatible<br/>/v1/chat/completions"]
    B --> C["curl"]
    B --> D["Python requests"]
    B --> E["OpenAI Python client"]
    B --> F["Any OpenAI-compatible<br/>application"]

    style A fill:#56cc9d,stroke:#333,color:#fff
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#f8f9fa,stroke:#333
    style D fill:#f8f9fa,stroke:#333
    style E fill:#f8f9fa,stroke:#333
    style F fill:#f8f9fa,stroke:#333

Start the Server

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --served-model-name my-model \
    --chat-template ./chat_template.jinja \
    --gpu-memory-utilization 0.90

Key options explained:

--served-model-name: Sets the model name exposed in the API (clients use this name in requests instead of the full HuggingFace path)
--chat-template: Path to a Jinja2 chat template file for formatting chat messages (useful for custom or fine-tuned models)
--gpu-memory-utilization: Fraction of GPU memory to use (0.0–1.0, default 0.9). Increase for larger models, decrease to leave room for other processes

Start with Custom Parameters

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --served-model-name my-model \
    --chat-template ./chat_template.jinja \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --dtype auto

Verify the Server

curl http://localhost:8000/v1/models

Querying the API

Using curl

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer your-secret-key" \
    -d '{
        "model": "my-model",
        "messages": [
            {"role": "user", "content": "What is vLLM?"}
        ],
        "temperature": 0.7,
        "max_tokens": 256
    }'

Using Python (requests)

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    headers={"Authorization": "Bearer your-secret-key"},
    json={
        "model": "my-model",
        "messages": [
            {"role": "user", "content": "Explain PagedAttention."}
        ],
        "temperature": 0.7,
        "max_tokens": 256,
    }
)

print(response.json()["choices"][0]["message"]["content"])

Using OpenAI Python Client (Recommended)

Since vLLM is OpenAI-compatible, you can use the official OpenAI client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)

response = client.chat.completions.create(
    model="my-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is continuous batching?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Serving Custom / Fine-tuned Models

If you fine-tuned a small LLM with Unsloth and exported it to GGUF (e.g., gguf_model_small), here is how to serve it with vLLM.

vLLM natively supports GGUF files — no conversion required. See the official vLLM GGUF documentation for full details.

Note: GGUF support in vLLM is experimental and under-optimized. Currently, only single-file GGUF models are supported. If you have a multi-file GGUF model, use gguf-split to merge them first.

graph TD
    A["Fine-tuned model"] --> B{"Export format?"}
    B -->|"GGUF"| C["Serve GGUF directly<br/>with vLLM"]
    B -->|"HF safetensors"| D["Serve HF format<br/>with vLLM"]
    B -->|"LoRA adapter"| E["Serve with<br/>--enable-lora"]

    C --> F["OpenAI-compatible API"]
    D --> F
    E --> F

    style A fill:#f8f9fa,stroke:#333
    style C fill:#56cc9d,stroke:#333,color:#fff
    style D fill:#56cc9d,stroke:#333,color:#fff
    style E fill:#56cc9d,stroke:#333,color:#fff
    style F fill:#6cc3d5,stroke:#333,color:#fff

Option A: Serve a GGUF file directly

Step 1: Prepare Your GGUF Model

After fine-tuning with Unsloth and exporting to GGUF, you should have a file like:

gguf_model_small/
├── added_tokens.json
├── chat_template.jinja
├── config.json
├── generation_config.json
├── merges.txt
├── model.safetensors
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer_config.json
├── unsloth.BF16.gguf
├── unsloth.Q4_K_M.gguf
└── vocab.json

Step 2: Serve with vLLM

Point vLLM directly at the GGUF file. Use --tokenizer to specify the base model’s tokenizer (recommended over the GGUF-embedded tokenizer for stability):

vllm serve ./gguf_model_small/unsloth.Q4_K_M.gguf \
    --tokenizer ./gguf_model_small \
    --served-model-name my-finetuned-model \
    --chat-template ./chat_template.jinja \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --gpu-memory-utilization 0.90 \
    --max-model-len 2048

You can also load GGUF models from HuggingFace using the repo_id:quant_type format:

vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
    --tokenizer Qwen/Qwen3-0.6B \
    --served-model-name qwen3-gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --gpu-memory-utilization 0.90

Add --tensor-parallel-size 2 to distribute across multiple GPUs:

vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
    --tokenizer Qwen/Qwen3-0.6B \
    --tensor-parallel-size 2 \
    --api-key your-secret-key

Step 3: Verify and Query

curl http://localhost:8000/v1/models \
    -H "Authorization: Bearer your-secret-key"

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)

response = client.chat.completions.create(
    model="my-finetuned-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is fine-tuning?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Option B: Serve in Hugging Face format (safetensors)

If you prefer maximum compatibility (e.g., with LoRA adapters or features not yet supported with GGUF), export in HF format instead:

# During fine-tuning with Unsloth, save in HF format
model.save_pretrained_merged("hf_model_small", tokenizer)

Then serve:

vllm serve ./hf_model_small \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key \
    --served-model-name my-finetuned-model \
    --chat-template ./chat_template.jinja \
    --gpu-memory-utilization 0.90 \
    --dtype auto \
    --max-model-len 2048

Serve a LoRA Adapter (Without Merging)

If you prefer to keep LoRA weights separate, vLLM supports serving LoRA adapters on top of a base model:

vllm serve Qwen/Qwen2.5-0.5B-Instruct \
    --enable-lora \
    --lora-modules my-lora=./lora_model \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key your-secret-key

Then query with the LoRA model name:

response = client.chat.completions.create(
    model="my-lora",
    messages=[{"role": "user", "content": "Hello!"}],
)

Docker Deployment

Deploy vLLM in a container for production environments.

graph LR
    A["vLLM Docker Image<br/>(vllm/vllm-openai)"] --> B["GPU Container<br/>(--gpus all)"]
    B --> C["Model loaded<br/>in container"]
    C --> D["Expose port 8000"]
    D --> E["Production traffic"]

    style A fill:#ffce67,stroke:#333
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#6cc3d5,stroke:#333,color:#fff
    style D fill:#56cc9d,stroke:#333,color:#fff
    style E fill:#56cc9d,stroke:#333,color:#fff

Dockerfile

FROM vllm/vllm-openai:latest

ENV MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct

CMD ["--model", "${MODEL_NAME}", "--host", "0.0.0.0", "--port", "8000"]

Run with Docker

docker run --gpus all \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Performance Optimization Tips

GPU memory utilization: Set --gpu-memory-utilization 0.90 to maximize GPU usage (range: 0.0–1.0)
Served model name: Use --served-model-name for cleaner API model names instead of long HuggingFace paths
Chat template: Use --chat-template to apply a custom Jinja2 chat template for fine-tuned models
Quantization: Use AWQ or GPTQ quantized models to reduce VRAM
Tensor parallelism: Use --tensor-parallel-size N for multi-GPU setups
Max model length: Reduce --max-model-len if you don’t need long contexts
Continuous batching: Enabled by default, handles concurrent requests efficiently
Streaming: Use stream=True for real-time token generation

vLLM vs Other Serving Solutions

Feature	vLLM	Ollama	TGI	llama.cpp
Throughput	Very High	Medium	High	Low-Medium
GPU Required	Yes	Optional	Yes	Optional
OpenAI API	Yes	Partial	Yes	Partial
Multi-GPU	Yes	No	Yes	No
Ease of Use	Medium	Easy	Medium	Medium
Best For	Production	Local Dev	Production	Edge/CPU

Conclusion

vLLM is the go-to solution for high-performance LLM serving in production:

Serves models with an OpenAI-compatible API
Handles high-concurrency with continuous batching
Optimizes GPU memory with PagedAttention
Supports custom and fine-tuned models
Deploys easily with Docker and Kubernetes

This workflow is perfect for:

Production AI APIs
Enterprise LLM platforms
High-traffic chatbot backends
Multi-model serving infrastructure

Combine with a RAG pipeline (LangChain + vLLM)
Add load balancing with Nginx or Traefik
Deploy on Kubernetes with GPU node pools
Monitor with Prometheus + Grafana
Serve multiple models with model routing